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Abstract 

The double digest problem is a common NP-hard approach to constructing physi- 
cal maps of DNA sequences. This paper presents a new approach cahed the enhanced 
double digest problem. Although this new problem is also NP-hard, it can be solved 
in linear time in certain theoretically interesting cases. 
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1 Introduction 

The physical mapping of DNA is a key problem in computational biology |^ . A map of a 
DNA sequence consists of the locations of some given small sequences like e.g. GAATTC. 
Biologists use such maps in a preparatory step to determine the target DNA sequence P|. 

A common technique of constructing maps uses restriction enzymes to cut a DNA 
sequence at the positions where a particular short DNA sequence appears. These positions 
are called restriction sites. One approach to modeling map construction is the double 
digest (DD) problem. Given two restriction enzymes A and B, this approach cuts a 
target DNA sequence using enzyme A, enzyme B, and both enzymes, separately. It is a 
biology fact that the restriction sites for enzymes A and B do not coincide. Throughout 
this paper, we make use of this fact. Let A, B and C be the three multisets of the 
lengths of the fragments formed after applying enzyme A, enzyme B and both enzymes 
to the target DNA sequence, respectively. Given A, B and C, the DD problem asks for 
permutations of the lengths in A and B such that if these sets of lengths are plotted on 
top of one another, the lengths of all the resulting subintervals formed due to overlapping 
match exactly the lengths in C. See Figure ^ for an example. 
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Figure 1: Stripes (a), (b) and (c) show the fragments resulting from the apphcations of 
enzyme A, enzyme B and both enzymes, respectively. In strip (c), the subfragments are 
created due to the overlapping between fragments in (a) and those in (b). 

Many algorithms |]6|-3, have been proposed for the DD problem. Stefik gave 
the first algorithm using artificial intelligence. Fitch, Smith and Ralph reduced the 
DD problem to the set partition problem. Goldstein and Waterman approached this 
problem with a stochastic annealing heuristic for the traveling salesman problem. They 
also showed that the DD problem is NP-hard by reducing the set partition problem to it. 

This paper suggests a new approach, called the enhanced double digest (FDD) problem. 
The FDD problem uses A, B, C and some additional length information; see Section ^ 
for the details of the approach. Although the FDD problem is still NP-hard, we show 
that if the lengths in C are all distinct, it can be solved in linear time. We also generalize 
the algorithm for the case where the number of duplicates in C is bounded by a constant. 
The time complexity of this generalized algorithm remains linear. 

Section Q details the new approach to define the FDD problem formally. Section ^ 
gives the linear-time algorithm for the case where C is duplicate-free. Also, it generalizes 
the algorithm to handle a small number of duplicate lengths. Section ^ proves that the 
FDD problem is NP-hard. Section ^ concludes with some directions for further work. 

2 Problem formulation 

Consider a target DNA sequence and two restriction enzymes A and B. 

• By applying enzyme A (respectively, B) to the target DNA sequence, we obtain p 
(respectively, q) fragments. Let A = {ai, . . . , Op} (respectively, B = {bi, . . . , bg}) be 
the multiset of the lengths of these p (respectively, q) fragments. 

• For i = 1, . . . ,p, let Oj be the fragment corresponding to Oj. We apply enzyme B to 
the fragment and obtain a set of subfragments. Let ABi be the multiset of the 
lengths of these subfragments. 

• For j = 1, . . . , g, let bj be the fragment corresponding to bj. We apply enzyme A to 
the fragment bj and obtain a set of subfragments. Let BAj be the multiset of the 
lengths of these subfragments. 
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For the example in Figure |l|, the following length information is gathered: 



• A= {ai = 9,a2 = 12, ag = 15, 04 = 17, as = 37}; B = {h = 6, 62 = 38, 63 = 46}; 

• AB, = {3, 6}; AB2 = {12}; AB^ = {15}; AB, = {17}; AB, = {8, 29}; 

• BA, = {6}; BA2 = {3, 8, 12, 15}; BA3 = {17, 29}. 

It is easily verified that the data found in this way has the following properties: 
Fact 1. 

1. Fori = l,...,p, ai = Ecgab, c- For j = 1, . . . ,q, bj = Ec&ba^ c. 

2. Ui ABi = U, BAj = C. 

3. \C\ = \A\ + \B\ - 1. 

Proof. Straightforward. □ 

Given A, B, ABi, . . . , ABp, BAi, . . . , BAg, the enhanced double digest problem V asks 
for a valid permutation {ttajITb) of the elements in A and B such that the following can 
be achieved. When the fragments for ai & A and bj for bj G B are plotted on the same 
line according to the order given by vr^ and tt^, a set of subfragments is formed due to 
overlapping. The multiset C of the lengths of these subfragments is required to be equal 
to Uf=iASi = U]^^BAj. In addition, 

• for every Oi E A (respectively, bj G B), ABi (respectively, BAj) is equal to the 
multiset of the lengths of the subfragments which overlap with Sj (respectively, bj). 

Note that an instance of this problem may have no solution or more than one valid 
permutation. The algorithms given in Section ^ can recover all valid permutations, if any 
exists. 



3 An efficient algorithm 

Unless otherwise stated, this section assumes that C has no duplicates. Let n = \C\. This 
section shows that the EDD problem V can be solved in 0(n) time. 



Section ^TT| formulates the EDD problem as a graph problem. Section |3]^ describes the 
linear-time algorithm. Section ^]3| discusses how to generalize this linear-time algorithm 
to the case where C may contain a small number of duplicates. 

3.1 A graph representation 

Given A, B, ABi, . . . , ABp, BAi, . . . , BAg, we construct an undirected graph G as follows. 

• The node set of G = A U 5 U C. 

• For every ai E A and every x G C, (oj, x) G G if x G ABi. 
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B 6 38 46 
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Figure 2: The graph G in (a) is constructed from the example in Figure |T]. G can be 
redrawn into a tree as shown in (b). The superscript A,B or G of each node denotes 
whether the node belongs to A, B or G . 

• For every hj G B and every a; G C, {hj,x) G G if x G -B^j- 

From the definition, we can observe that G satisfies the following lemma. 

Lemma 2. G is connected. For each node in AU B, its degree is at least 1 and it is 
adjacent to nodes in G only. Also, every node in G connects to exactly one node in A and 
one node in B. 

Proof. Straightforward based on the assumption that G has no duplicates. □ 

If V has a valid permutation, G has two more properties as stated in Lemma ^ 
Figure |^ illustrates an example. A diameter of a tree is a path with the largest number 
of edges. A dangler is a 2-node-long path. Given a tree T, a subtree r of T is said to be 
hanged on a path P in T if r is a tree in the forest T — P. 

Lemma 3. IfV has a valid permutation, then the following statements hold. 

1. G is a tree. 

2. For any diameter S of G, the subtrees hanged on S must he danglers. 
Proof. 

Statement To prove by contradiction, suppose that G contains a cycle D. By the 
construction of G, D must be of the form 
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Figure 3: In this example, all G A, bj G B and Ck G C. 

where ii = i^+i] ai^,...,ai^ G A; bj^,..., bj^ G B; and Cfc^ , . , Ck^^ G C. 

By definition, if ai,Ck,bj is a path in G, then Oj and bj overlap by Ck in any valid 
permutation of V. Thus, for 1 < £ < 2; — 1, the existence of the subpath ai^, . . . , ai^^^ of D 
in G means that bi^ overlaps with Sj^ and Oj^^^ and foj^^^ overlaps with Sj^^^ and Sj^^j- To 
enable both bi^ and bj^^^ overlap with Sj^^^, Sj^^^ must be in the middle of Sj^ and Si^^^ 
1 < £ < 2; — 1. Consequently, for 1 < £ < 2 — 1, Sj^ is in the middle of Sj^ and Sj^^-^ = Sj^, 
which is impossible. 

Statement ^ For any diameter S of G, we show that every subtree r hanged on 
must be a dangler. First, r must be hanged on at a node m. AU B. Otherwise, if r is 
hanged on 5" at a node c & c has degree greater than 2, contradicting Lemma ^ Then, 
r has more than one node because the root of r is a node in C and must be of degree 2. 
If r cannot have more than 2 nodes. Statement |^ follows. 

To prove by contradiction, suppose that r has more than two nodes. Without lost of 
generality, assume that r is hanged on 5* at a node G A and the root of r is a node 
Cfcg G C. Note that Ck^ has another neighbour, say 6^3, from B. If r contains more than 
two nodes, bj^ must has a child, say c^g, from C and c^g must has a child, say afcg, from 
A. Thus, r must have a root-to-leaf path of length more than 4. Then, the two paths 
from to both ends of S must be of length more than 4. Otherwise, S cannot be a 
diameter of G. From those observations, G has the pattern shown in Figure ^. According 
to the pattern, bj^,bj^ and bj^ overlap with ai^. Therefore, in any valid permutation, one 
of , and bj.^ , say bj^ , must be in the middle of the other two fragments and bj.^ can 
only overlap with aj^. However, according to the pattern in Figure |^, for ^ = 1,2,3, 
overlaps with another fragment Sj^, reaching a contradiction. □ 

Now, we know that if V has a valid permutation, G satisfies the two properties of 
Lemma ^. The remainder of this section show that the converse of this statement is also 
true. Suppose that G is a tree with a diameter S such that all the subtrees hanged on S 
are danglers. We define t^c to be a permutation on G formed by a search defined below. 

Dangler-first search: Traverse G starting from one end of S to the other end of S; read 
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off the nodes in C on S] whenever meet any node x with degree greater than 2, read 
off the nodes in C in the danglers hanged on iS at x in any order and continue to 
traverse S. 

Lemma 4. The elements in each ABi form a consecutive subsequence in nc- Similarly, 
the elements in each BAj form a consecutive subsequence in ttc- 

Proof. For each i, if ABi contains only one element, then the lemma follows. Otherwise, 
Oj is of degree at least 2. Then, a, must be on the diameter S. Let c and c' be elements 
in ABi which are the two neighbours of on S. The remaining nodes in AB^ must be 
located in the danglers hanged on at Oj. By dangler- first search, all the elements in 
ABi niust form a consecutive subsequence in nc- By symmetry, for each j, the elements 
in BAj must form a consecutive subsequence in nc- □ 

By Lemma ^, ttq can be partitioned into p subintervals such that the rth interval 
contains the elements in ABi^ for r = 1, . . . ,p. Let it a be the permutation (aj^, . . . , aj^). 
Similarly, ttc can be partitioned into q intervals such that the sth interval contains the 
elements in BAj^ for s = 1, . . . ,q. Let tib be the permutation . . . ,bj ). We call 
(ttajTTs) the induced permutation of tvc- 

Lemma 5. The induced permutation (ttaj'^b) of ttc is a valid permutation ofV. 

Proof. Suppose the lengths from A, B and C are plotted on the same line according to the 
order given by tta, ttb and ttc, respectively. Consider the stripes formed from A and C. 
By Fact ^ and Lemma ^, for each z, Sj overlaps with c for all c G AB^. By symmetry, for 
each J, bj overlaps with c for all c G BAj. Then, by the definition of the EDD problem, 
{'t^A)T^b) is a valid permutation. □ 

Theorem 6. Given the enhanced double digest problem V and its corresponding graph 
G, V has a valid permutation if and only if G satisfies the two properties in Lemma 

Proof. The only-if part follows from Lemma The if part follows from Lemma ^. □ 

3.2 A linear-time algorithm for a duplicate-free C 

This section describes how to compute a valid permutation of V in 0{n) time. The 
algorithm is as follows. 

Algorithm Enhanced-Double-Digest 

1. Construct the graph G corresponding to V. 

2. If G does not satisfy the two properties in Lemma |^, then return "no valid permu- 
tation" . 

3. Find the permutation ttq using dangler-first search. 

4. Find the induced permutation (tt^jTTb) of nc- 

5. Return (tta, ttb)- 
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Lemma 7. Algorithm Enhanced- Double- Dig est can correctly find a valid permutation in 
0{n) time. 

Proof. First, by Lemma |^ and Theorem ^, Enhanced-Double-Digest is correct. As for its 
time complexity, Step 1 requires 0{n) time as G contains 2n edges and we can find each 
edge in 0(1) time. Step 2 checks whether G satisfies the two properties in Lemma |^. For 
property |l], we can determine whether a graph is a tree in 0{n) time. For property ^ we 
can compute a diameter of a tree in hnear time first, then, we verify whether G satisfies 
property || by detecting whether the subtrees hanged on the diameter are danglers. Thus, 
Step 2 requires 0{n) time. Step 3 finds t^c using dangler- first search. Since the search 
scans every node in G once, it runs in 0{n) time. Step 4 finds the induced permutation 
(tta, t^b) of in 0{n) time. In summary, a valid permutation of V can be computed in 
0{n) time. □ 

By modifying Algorithm Enhanced-Double-Digest slightly, we can report all valid 
permutations of V. First, observe that the valid permutations of V depend on the possible 
permutations tic- There are three cases. 

Case 1: G does not have any dangler. Then, there is a unique tic- Thus, the current 
algorithm reports all valid permutations of V. 

Case 2: G has one set of danglers hanged on one node of its diameter. Then, the 
possible permutations tic depend on the permutation of the set of nodes in the danglers 
which belong to C. For the example in Figure 0, the possible permutations tiq can be 
represented by 

6, 3, permutation(12, 15), 8, 29, 17. 

All valid permutations vr^ and tt^ can be represented by 9, permutation(12, 15), 37, 17 and 
6,38,46, respectively. These valid permutations can be reported by modifying Steps 3 
and 4 of the algorithm. The time complexity of the modified algorithm is still 0{n). 

Case 3: G has k sets of danglers hanged on k respective nodes of its diameter. Then, 
the possible permutations tic can be represented similarly, except that each ttc contains 
k permutation blocks. The above modified algorithm is sufficient to report all valid 
permutations of V. 

3.3 A general algorithm for C with few duplicates 

The algorithm Enchanced-Double-Digest in Section |3]^ can solve the EDD problem if C 
contains no duplicates. Here, we give an algorithm which works without this assumption. 
First, we consider the following example. 

• A = {ai = 18, 02 = 19}; B = {h = 4, 62 = 5, 63 = 7, h = 8, 65 = 13}; 
. A5i = {5,6,7};AS2 = {4,7,8}; 

. BA, = {4}; BA2 = {5}; BA, = {7}; BA, = {8}; BA, = {6, 7}. 

In this example, there are two 7's in C = UiABi = UjBAj. These two 7's in fact 
represent two different subfragments in the target DNA sequence. To distinguish them, 
let the copy of 7 in ABi be 7i and that in AB2 be 72. Since 7 also belongs to BA^ 
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4 8 7i 6 5 72 4 8 7i 5 6 72 




Figure 4: (a) is the case where 7i G BA^ and 72 G BA^; (b) is the case where 7i G 
and 72 G 

and -BA5, there are two possible combinations, namely, (a) 7i G and 72 G -8^3 

and (b) 7i G -8^3 and 72 G Figure ||(a) and Kb) illustrate the graph G for both 

cases; from these two graphs G, we can obtain a valid permutation from combination 
(a). Therefore, we can handle duplicates in C by giving them different subscripts. Then, 
all the elements in C are different and we can solve the enhanced double digest problem 
using the algorithm Enhanced-Double-Digest in Section \i.2\ More precisely, we have the 
following algorithm. 

1. If C contains duplicates, then we assign a unique subscript to each duplicate. 

2. For each possible combinations of the subscripts in the duplicates, we execute 
Enhanced-Double-Digest to compute a valid permutation. 

Let £ be the number of duplicates in G. The above algorithm execute Enhanced- 
Double-Digest for at most £\ time. Therefore, a valid permutation can be computed in 
0{i\n) time. Thus, if i is constant, the generalized algorithm still runs in linear time. 

4 The enhanced double digest problem is NP-hard 

This section proves the NP-hardness of the enhanced double digest problem by a reduction 
from the Hamiltonian Path problem [0]. 

Given an undirected graph H, we show that in polynomial time, we can construct 
an EDD instance Q so that H contains a hamiltonian path if and only if Q has a valid 
permutation. For ease of prove, we augment H with two new nodes t and z. All nodes 
originally in H have edges to t. In addition, we add an edge {t, z) to H. Note that the 
original H contains a hamiltonian path if and only if the amended H has a hamiltonian 
path. Let £ be the number of nodes in H. Assume that the nodes in H are labeled by 
{1,2,..., £}. For each node let n{v) be the number of neighbours of v. Let v' = v + £. 
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The EDD instance Q is given the following length information. Note that this length 
information can be constructed from H in polynomial time. 



A = {tty I V E H} where = t', at = t + Y.ueH-{t,z} ^^(^ = v + Y.(u,v)£H fo^^ 

V ^ z,t. Also, AB^ = {t'}; ABt = {u' \ u e H - {t,z}} U {t}; and AB^ = {u' \ 
{u, v) G H} U {v} for v ^ z. 

B = {bv, . . . , bv(K.{v)-i) \ V E H — {z}} where = v + v' and = v' for all 

V E H — {2} and all i < k,{v) — 1. Also, BA^ = {v,v'} and BA^i^i) = {v'}. 



Lemma 8. H has a hamiltonian path if and only if there is a valid permutation for Q. 
Proof. The two directions are proved as follows. 
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{v' I {V,t) e H,V^Ui-2} 



Figure 5: The permutations tta and ttb of A and B, respectively. 

{=^) Let ui,U2, ■ . . ,ue-2,t, z be a hamiltonian path in H. Let tta and tcb be per- 
mutations of A and B as shown in Figure ^. It is easy to check that (tt^jTTb) is a valid 
permutation to Q. 

(<^=) Let {tta, t^b) be a valid permutation of Q. The remainder of this proof shows 
that the ordering of the lengths in vr^ defines a hamiltonian path in H . 

Assume the lengths from A are plotted on a line according to the order given by vr^ 
and similarly, the lengths from B are also plotted on this line according to vr^. For each 
V E H, the line fragment corresponds to G A is called For each v E H — {z}, the 
line fragment corresponds to by E B, is called by. 

For every v E H — {z}, since BAy = {v,v'}, by overlaps with two consecutive line 
fragments from A; in addition, the overlapping regions between by and these two line 
fragments must be of length v and v', respectively. Observe that v E ABy and v ^ ABy 
for all u ^ V. One of these two fragments, which overlaps with by, must be Oy. The other 
line fragment can be a„ for any u E H with v' E ABy., i.e., (f , u) E H. 

Let TiA = • • • 5 Ou,)- /,From the above argument, we know that, for every two 
consecutive line fragments 2^ and Oj+i, there exists a fragment by (where v is either Ui 
or Mj+i) which overlaps with both and a„-^-^. The above argument also implies that 
(wj, Mj+i) G H. Thus, ui, . . . ,ue forms a path in if. As Ui, . . . ,ue contains all the i nodes 
of H, this path is a hamiltonian path. □ 
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5 Further research directions 



This highly theoretical work can be extended in several directions. One direction is 
to design a series of laboratory procedures that can actually produce the input length 
information in the required form. Another direction is to consider the problem of more 
than 2 digesting enzymes. Using multiple enzymes could help resolve the issue of multiple 
solutions that arise when there are danglers or duplicate subfragment lengths. Also, the 
extra input may actually make the problem solvable in a shorter period of time. The 
third direction is to have a probabilistic analysis of the number of duphcates in C, when 
the length of the target DNA sequence is given. It would be the most meaningful to 
conduct such analysis under a probabilistic model that is derived specifically for feasible 
laboratory procedures. Lastly, this paper does not address the issue of noise in the length 
data. From the practical point of view, handling noise effectively is very important. 
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